Effects of Different Missing Data Imputation Techniques on the Performance of Undiagnosed Diabetes Risk Prediction Models in a Mixed-Ancestry Population of South Africa
نویسندگان
چکیده
BACKGROUND Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation. METHODS Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models' discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment. RESULTS The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4%) had missing data. Family history had the highest proportion of missing data (25%). Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals). Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods. CONCLUSIONS Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.
منابع مشابه
Independent external validation and comparison of prevalent diabetes risk prediction models in a mixed-ancestry population of South Africa
BACKGROUND Guidelines increasingly encourage the use of multivariable risk models to predict the presence of prevalent undiagnosed type 2 diabetes mellitus worldwide. However, no single model can perform well in all settings and available models must be tested before implementation in new populations. We assessed and compared the performance of five prevalent diabetes risk models in mixed-ances...
متن کاملInfluence of Pattern of Missing Data on Performance of Imputation Methods: An Example from National Data on Drug Injection in Prisons
Background Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern...
متن کاملMissing data imputation in multivariable time series data
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...
متن کاملThe 30-year cardiovascular risk profile of South Africans with diagnosed diabetes, undiagnosed diabetes, pre-diabetes or normoglycaemia: the Bellville, South Africa pilot study
The aim of this pilot study was to assess the 30-year risk for cardiovascular disease (CVD) in the South Africa population of mixed-ancestry in individuals with non-diabetic hyperglycaemia, and undiagnosed and self-reported diabetes. Participants were drawn from an urban community of the Bellville South suburb of Cape Town. In total, 583 subjects without a history of CVD were eligible for lifet...
متن کاملاهمیت خویشاوندی ژنتیکی و رکورد فنوتیپی بر صحت ژنومی دادههای جانهی شبیه سازی شده با استفاده از مدل های حیوانی در حضور اثرات متقابل ژنوتیپ و محیط
The objective of this study was to investigate the role of genetic relationships between training and validation set with considering different ratio of phenotypic records of training set on accuracy of genomic prediction via animal models containing genotype × environment interactions in simulated imputation data. For this purpose, four different scenarios using 15k density containing differen...
متن کامل